Overview of Dataset

The dataset was obtained from Kaggle. It had 299 observations and 13 variables. the outcome variable ‘DEATH_EVENT’ indicates whether a patient died of heart failure or not based on 11 other predictors. The variable names are shown below:

NB: The 12th variable ‘time’ indicated the time from the start of the study after which the study was terminated. This,presumably,could be either because the subject was declared healthy, or dropped out of the study for various reasons, or died from heart failure. To avoid target leakage, since that time would not be available in real world instances when the resultant model is being used to predict the outcome of a new case, the ‘time’ variable would not be used as a feature to train the model.

## Index(['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes',
##        'ejection_fraction', 'high_blood_pressure', 'platelets',
##        'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time',
##        'DEATH_EVENT'],
##       dtype='object')

Brief Exploratory Data Analysis

We can do a quick overview of the only two demographic variables from the dataset: age and sex. from the output below, we realize that the age range of the respondents is 40 to 95 years with a median age of 60 years and an average age of approximately 60 years.

## count    299.000000
## mean      60.833893
## std       11.894809
## min       40.000000
## 25%       51.000000
## 50%       60.000000
## 75%       70.000000
## max       95.000000
## Name: age, dtype: float64